✨ Support `taxa="plants"` in `EnsemblGene` #153

mossjacob · 2024-11-12T14:34:31Z

Some use cases, for example adding plants, requires adding a keyword argument specifying the kingdom in EnsemblGene.

sunnyosun · 2024-11-12T14:40:23Z

Could you check if this is the correct db?

bionty/bionty/base/entities/_gene.py

Line 104 in e75518c

f"mysql+mysqldb://anonymous:@ensembldb.ensembl.org/{self._organism.core_db}"

I think for plants it's a different db.

mossjacob · 2024-11-12T15:05:11Z

I think it should be the same database (although it is not working). See https://plants.ensembl.org/info/data/mysql.html

Ensembl Genomes databases from all five divisions are located on the same server

and

The following conventions apply:
core databases - <genus_species>core<eg_version><ensembl_version><assembly_version>

But I can't locate, for example, a database such as

arabidopsis_thaliana_core_57_10

Which should be there as it is a model organism..

sunnyosun · 2024-11-12T15:12:20Z

Could you try mysql+mysqldb://anonymous:@ensembldb.ensembl.org/plants/{self._organism.core_db}?

mossjacob · 2024-11-12T16:09:55Z

also an Unknown database, unfortunately!

mossjacob · 2024-11-12T16:45:02Z

This is particularly odd: loading from mysql+mysqldb://anonymous:@ensembldb.ensembl.org works for items in

https://ftp.ensemblgenomes.ebi.ac.uk/pub/current/vertebrates/mysql/

but not for items in

https://ftp.ensemblgenomes.ebi.ac.uk/pub/current/plants/mysql

mossjacob · 2024-11-12T17:05:33Z

Found the fix. Needed to add the right port for some reason:

mysql+mysqldb://anonymous:@ensembldb.ensembl.org:4157/arabidopsis_thaliana_core_60_113_11

It seems the plant and the vertebrates databases use different ports. I will alter this PR to include a check.

sunnyosun · 2024-11-12T17:06:17Z

What about using mysql-eg-publicsql.ebi.ac.uk instead of ensembldb.ensembl.org? It's the first option here: https://plants.ensembl.org/info/data/mysql.html

mossjacob · 2024-11-12T17:53:44Z

Reference for the port fix

mossjacob · 2024-11-13T11:30:31Z

I think this is still incomplete, actually--loading in the downloaded Ensembl plant genes doesn't work because the data frames don't have the ensembl_gene_id column which is set in bionty.Gene._ontology_id_field. Instead, they have stable_id or ncbi_gene_id columns. Unsure if I should create a new Gene model, maybe PlantGene with a different _ontology_id_field or whether you think there is a better way to resolve this?

Zethson · 2024-11-13T12:20:38Z

@mossjacob I'll look into this. I'll report back

sunnyosun · 2024-11-13T12:33:21Z

I think this is still incomplete, actually--loading in the downloaded Ensembl plant genes doesn't work because the data frames don't have the ensembl_gene_id column which is set in bionty.Gene._ontology_id_field. Instead, they have stable_id or ncbi_gene_id columns. Unsure if I should create a new Gene model, maybe PlantGene with a different _ontology_id_field or whether you think there is a better way to resolve this?

Ah, this is similar to the yeast case, could you take a look here? https://bionty-assets-gczz.netlify.app/ingest/gene-ensembl-release-112#saccharomyces-cerevisiae

mossjacob · 2024-11-13T13:39:35Z

Thanks, I took a look at that notebook. I think it is the same as that example, and I get the same output ("no ensembl_gene_id found, writing to table_id column."), but then when I try to run:

gene_source = bt.Source().filter(organism="plants", entity="bionty.Gene").first()
bt.Gene.import_from_source(source=gene_source)

it looks for the field ensembl_gene_id which doesn't exist for these tables, in line 241 _from_values.py of lamindb;

result = public_ontology.inspect(iterable_idx, field=field.field.name, mute=True)

Signed-off-by: zethson <[email protected]>

Zethson · 2024-11-14T12:11:35Z

Dear @mossjacob,

sorry, I'm still catching up.

Please don't be confused by the failing CI. This is an issue with forked repositories not having access to our secrets. It's on my TODO list to changes.
I pushed a few changes to your PR that refactor it a bit and also fix a few issues with the download of the Gene ontologies. I am surprised that you got it to run with the current code because with the latest versions of the dependencies it didn't work for me anymore.
I created a PR that demonstrates the usage ✨Arabidopsis thaliana release 57 bionty-assets#41 - I guess this is also where you ended up. See: https://6735e98c1a52cf5bf1340a87--bionty-assets-gczz.netlify.app/ingest/plant-gene-ensembl-release-57

After lunch, I will look into your last issue. I'll report back!

Signed-off-by: zethson <[email protected]>

Zethson · 2024-11-14T13:32:51Z

Concerning

Thanks, I took a look at that notebook. I think it is the same as that example, and I get the same output ("no ensembl_gene_id found, writing to table_id column."), but then when I try to run:
gene_source = bt.Source().filter(organism="plants", entity="bionty.Gene").first()
bt.Gene.import_from_source(source=gene_source)
it looks for the field ensembl_gene_id which doesn't exist for these tables, in line 241 _from_values.py of lamindb;

result = public_ontology.inspect(iterable_idx, field=field.field.name, mute=True)

@sunnyosun made me aware that this is a current limitation of from_source that does not support stable_id. I'll make an issue for this. What works for saccharomyces cerevisiae (which is similar to yours as you noted above) is the following:

!lamin init --storage run-tests --schema bionty

import lamindb as ln
import bionty as bt

# The instance is empty. Therefore, we add saccharomyces cerevisiae
bt.Organism.from_source(ontology_id="NCBITaxon:559292").save()

# Save all gene records to the instance
genes = bt.Gene.from_values(bt.Gene.public(organism="saccharomyces cerevisiae").df()["stable_id"],
                                    field="stable_id",
                                    organism="saccharomyces cerevisiae"
                                    )
ln.save(genes)

# Look at our new genes
bt.Gene.df()

Does this help you? Edit: Sorry for closing - I fat fingered the wrong button.

mossjacob · 2024-11-14T13:35:55Z

Hi! Thank you for this!
I will try this out at some point today or tomorrow. For now I was using:

prev_ontology_id = bt.Gene._ontology_id_field
bt.Gene._ontology_id_field = "stable_id"
bt.Gene.import_from_source(source=gene_source)
bt.Gene._ontology_id_field = prev_ontology_id

which I know is not ideal!

sunnyosun · 2024-11-14T13:49:04Z

Hi! Thank you for this! I will try this out at some point today or tomorrow. For now I was using:
prev_ontology_id = bt.Gene._ontology_id_field
bt.Gene._ontology_id_field = "stable_id"
bt.Gene.import_from_source(source=gene_source)
bt.Gene._ontology_id_field = prev_ontology_id
which I know is not ideal!

It's super cool that you figured this out even though _ontology_id_field is not user-facing at all!

Then no need to try from_values, we'll make a proper fix for import_from_source!

mossjacob · 2024-11-14T14:05:41Z

I enjoy a good debug :)
Thanks!

Zethson · 2024-11-14T14:21:47Z

Okay so apparently Pandas 2.2 is not compatible with sqlalchemy 1.4 which I still had on my PC. I reverted the changes now that I made earlier to the SQL statements that fixed that.

I'll make the CI run on this PR soon and then we can consider merging this.

Would you like us to also add some plant organisms genes such as arabidopsis thaliana to Bionty so that it works out of the box for you?

Signed-off-by: zethson <[email protected]>

falexwolf · 2024-11-14T14:29:54Z

sqlalchemy < 2 is no good to use anymore! 😇 😆

falexwolf · 2024-11-14T14:30:49Z

Impressively low-level contributions! @mossjacob 😄

mossjacob · 2024-11-14T15:19:44Z

Thanks everyone. Re adding to bionty-assets, while that would be nice, I envisage using quite a few different species so adding all to bionty-assets may be overkill at this point?

In this PR there's a for loop for adding multiple organisms, and I'm also using the code below to download on the fly:

def verify_organism_exists(organism, version="release-57"):
    if bt.Source().filter(organism=organism).count() == 0:
        # Try syncing
        bt.core.sync_all_sources_to_latest()
        if bt.Source().filter(organism=organism).count() == 0:
            # If the source still does not exist, then download it.
            print("Organism does not exist in bionty.")
            print(f"Attempting to download {organism}...")
            ensembl_gene = EnsemblGene(organism=organism, version=version, kingdom="plants")
            print("URL:", ensembl_gene._url)
            df = ensembl_gene.download_df()
            df["description"] = df["description"].str.replace(r"\[.*?\]", "", regex=True)
            filename = f"df_{organism}__ensembl__{version}__Gene.parquet"
            df.to_parquet(filename)
            print(f"Downloaded {organism} to {filename}")
            raise ValueError(f"Add '{filename}' to sources_local.yaml and run bt.core.sync_all_sources_to_latest()")

With the change I made in this other PR, the URL to the local parquet file created can be added to sources_local.yaml and synced.

[edit] the code has been updated

mossjacob · 2024-11-14T15:53:39Z

Slightly altered the way gene tables are onboarded: the check for the ensembl_gene_id column to consist only of ENS-prefixed IDs is quite strong; for example, for rice (Oryza sativa), some IDs are prefixed by ENS (seems to be mostly RNA) and protein-coding genes are prefixed by "Os". Without this change, all genes are removed from the df in the else clause.

Signed-off-by: zethson <[email protected]>

Zethson · 2024-11-15T13:13:11Z

Great @mossjacob! Thank you very much for your enthusiasm and contributions.

I had renamed kingdom to taxa. Is that fine with you or would you prefer kingdom?
There's a couple of follow up issues here that I would like to tackle in the future.
2.1 Making this less Ensembl focused because the class is even named like that. We can generalize this better. This includes import_source currently only supports ensembl_gene_id but not stable_id or other IDs #160
2.2 Currently the code is weirdly mixing organism and taxa. We were overloading the organism parameter to handle both but this doesn't really make sense. I would like to decouple that more clearly to get rid of the tiny hack that I introduced in this PR.

I am ready to merge the PR now unless you want to keep building here? We'll also merge your sister PR for local parquet files then.

mossjacob · 2024-11-19T09:13:37Z

Hi @Zethson , I am also ready for this to be merged in now! I still have to write a test for the local parquet file PR though.
Many thanks

Signed-off-by: zethson <[email protected]>

add option to specify kingdom

e75518c

mossjacob mentioned this pull request Nov 12, 2024

Added source does not show up in lamindb instance #152

Closed

fix port for plants

a1d2d6a

sunnyosun changed the title ~~Adds option to specify plants or vertebrates when loading an organism's genes~~ Adds option to specify kindeom=plants in EnsemblGene Nov 12, 2024

sunnyosun changed the title ~~Adds option to specify kindeom=plants in EnsemblGene~~ Support kindeom=plants in EnsemblGene Nov 12, 2024

sunnyosun changed the title ~~Support kindeom=plants in EnsemblGene~~ ✨ Support kindeom=plants in EnsemblGene Nov 12, 2024

sunnyosun changed the title ~~✨ Support kindeom=plants in EnsemblGene~~ ✨ Support kindeom="plants" in EnsemblGene Nov 12, 2024

sunnyosun changed the title ~~✨ Support kindeom="plants" in EnsemblGene~~ ✨ Support kindom="plants" in EnsemblGene Nov 12, 2024

🎨 Polish

f414717

Signed-off-by: zethson <[email protected]>

sunnyosun changed the title ~~✨ Support kindom="plants" in EnsemblGene~~ ✨ Support kingdom="plants" in EnsemblGene Nov 13, 2024

🎨 Enable SQL queries again

f6758ac

Signed-off-by: zethson <[email protected]>

🎨 Polish

d003058

Signed-off-by: zethson <[email protected]>

Zethson changed the title ~~✨ Support kingdom="plants" in EnsemblGene~~ ✨ Support taxa="plants" in EnsemblGene Nov 14, 2024

Zethson closed this Nov 14, 2024

Zethson reopened this Nov 14, 2024

🎨 Revert sql statement changes

96689fe

Signed-off-by: zethson <[email protected]>

support where some of ensembl_gene_id column are not ensembl gene ids

d4e8674

Zethson added 5 commits November 15, 2024 10:57

✨ Add fork protected CI (laminlabs#157)

6e91ff7

Signed-off-by: zethson <[email protected]>

Merge branch 'main' into add-plants

26a5d1a

🐛 Only configure AWS credentials if not from fork

67252ac

Signed-off-by: zethson <[email protected]>

Merge branch 'main' into add-plants

9ea99ca

🎨 Pre-commit

436954d

Signed-off-by: zethson <[email protected]>

Zethson force-pushed the main branch from b7c84c1 to d0fd679 Compare November 15, 2024 11:05

Merge branch 'main' into add-plants

81a8f9c

Zethson mentioned this pull request Nov 15, 2024

import_source currently only supports ensembl_gene_id but not stable_id or other IDs #160

Open

Zethson mentioned this pull request Nov 19, 2024

Disentangle organism and taxa better #163

Open

Zethson merged commit a0ce2c3 into laminlabs:main Nov 19, 2024
3 checks passed

sunnyosun referenced this pull request Nov 19, 2024

🐛 Revert taxa - organism change

1eefa91

Signed-off-by: zethson <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ Support `taxa="plants"` in `EnsemblGene` #153

✨ Support `taxa="plants"` in `EnsemblGene` #153

mossjacob commented Nov 12, 2024

sunnyosun commented Nov 12, 2024

mossjacob commented Nov 12, 2024

sunnyosun commented Nov 12, 2024 •

edited

Loading

mossjacob commented Nov 12, 2024

mossjacob commented Nov 12, 2024 •

edited

Loading

mossjacob commented Nov 12, 2024

sunnyosun commented Nov 12, 2024

mossjacob commented Nov 12, 2024 •

edited

Loading

mossjacob commented Nov 13, 2024

Zethson commented Nov 13, 2024

sunnyosun commented Nov 13, 2024

mossjacob commented Nov 13, 2024

Zethson commented Nov 14, 2024 •

edited

Loading

Zethson commented Nov 14, 2024 •

edited

Loading

mossjacob commented Nov 14, 2024

sunnyosun commented Nov 14, 2024 •

edited

Loading

mossjacob commented Nov 14, 2024

Zethson commented Nov 14, 2024 •

edited

Loading

falexwolf commented Nov 14, 2024 •

edited

Loading

falexwolf commented Nov 14, 2024

mossjacob commented Nov 14, 2024 •

edited

Loading

mossjacob commented Nov 14, 2024

Zethson commented Nov 15, 2024

mossjacob commented Nov 19, 2024

✨ Support taxa="plants" in EnsemblGene #153

✨ Support taxa="plants" in EnsemblGene #153

Conversation

mossjacob commented Nov 12, 2024

sunnyosun commented Nov 12, 2024

mossjacob commented Nov 12, 2024

sunnyosun commented Nov 12, 2024 • edited Loading

mossjacob commented Nov 12, 2024

mossjacob commented Nov 12, 2024 • edited Loading

mossjacob commented Nov 12, 2024

sunnyosun commented Nov 12, 2024

mossjacob commented Nov 12, 2024 • edited Loading

mossjacob commented Nov 13, 2024

Zethson commented Nov 13, 2024

sunnyosun commented Nov 13, 2024

mossjacob commented Nov 13, 2024

Zethson commented Nov 14, 2024 • edited Loading

Zethson commented Nov 14, 2024 • edited Loading

mossjacob commented Nov 14, 2024

sunnyosun commented Nov 14, 2024 • edited Loading

mossjacob commented Nov 14, 2024

Zethson commented Nov 14, 2024 • edited Loading

falexwolf commented Nov 14, 2024 • edited Loading

falexwolf commented Nov 14, 2024

mossjacob commented Nov 14, 2024 • edited Loading

mossjacob commented Nov 14, 2024

Zethson commented Nov 15, 2024

mossjacob commented Nov 19, 2024

✨ Support `taxa="plants"` in `EnsemblGene` #153

✨ Support `taxa="plants"` in `EnsemblGene` #153

sunnyosun commented Nov 12, 2024 •

edited

Loading

mossjacob commented Nov 12, 2024 •

edited

Loading

mossjacob commented Nov 12, 2024 •

edited

Loading

Zethson commented Nov 14, 2024 •

edited

Loading

Zethson commented Nov 14, 2024 •

edited

Loading

sunnyosun commented Nov 14, 2024 •

edited

Loading

Zethson commented Nov 14, 2024 •

edited

Loading

falexwolf commented Nov 14, 2024 •

edited

Loading

mossjacob commented Nov 14, 2024 •

edited

Loading